Note: To organize our responses, questions in the book chapters appear in bold text. We’ve done the same for each section of the Tidy Tuesday assignment prompt.

Healy Chapter 4

“Show The Right Numbers”

Libaries used in Ch. 4 exercises

library(gapminder)
library(tidyverse)
library(socviz)
library(ggrepel)

Explore gapminder data set

head(gapminder %>% 
       arrange(desc(gdpPercap))) 
## # A tibble: 6 × 6
##   country continent  year lifeExp     pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
## 1 Kuwait  Asia       1957    58.0  212846   113523.
## 2 Kuwait  Asia       1972    67.7  841934   109348.
## 3 Kuwait  Asia       1952    55.6  160000   108382.
## 4 Kuwait  Asia       1962    60.5  358266    95458.
## 5 Kuwait  Asia       1967    64.6  575003    80895.
## 6 Kuwait  Asia       1977    69.3 1140357    59265.

4.8 Where to Go Next

Revisit the gapminder plots at the beginning of the chapter and experiment with different ways to facet the data. Try plotting population and per capita GDP while faceting on year, or even on country. In the latter case you will get a lot of panels, and plotting them straight to the screen may take a long time. Instead, assign the plot to an object and save it as a PDF file to your figures/ folder. Experiment with the height and width of the figure.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = pop))
p + geom_line(aes(group = country)) + facet_wrap(~year)

p + facet_wrap(~year)

Healy_Ch.4_4.8.1 <- p + geom_line(aes(group = country)) + facet_wrap(~country)
ggsave("Healy_4.8.1.pdf", plot=Healy_Ch.4_4.8.1, height=8, width=12)
ggsave("Healy_4.8.1.pdf", plot=Healy_Ch.4_4.8.1, height=20, width=40)
# in addition, we also generated a separate plot below by filtering down to select European countries while analyzing the same variables for a more manageable viz 

euro_gpd_pop <- gapminder %>% 
  filter(continent == "Europe", country %in% c("France", "Italy", "Germany", "Turkey")) %>%
  ggplot(aes(x = gdpPercap, y = pop)) + geom_line() + facet_wrap(~country, ncol = 4)

euro_gpd_pop + geom_smooth(size = 1.1, method = "loess", se = FALSE) + labs(x = "GDP per Captia",
         y = "Population",
         title = "GDP per capita on Five Continents")

Investigate the difference between a formula written as facet_ grid(sex ~ race) and one written as facet_grid(~ sex + race).

Age_Childs <-ggplot(gss_sm,mapping = aes(x = age, y = childs))
Age_Childs + 
  geom_point()+
  geom_smooth() +
  facet_grid(sex ~ race)

Age_Childs +  
  geom_point()+ 
  geom_smooth() + 
  facet_grid(~sex + race)

facet_ grid(sex ~ race) creates facets breaking out the data by sex (in the rows) and race (in the columns). facet_grid(~ sex + race) creats facets breaking out the data by sex and race using columns only, combining each uniqu race+sex value into a column.

Experiment to see what happens when you use facet_wrap() With more complex formulas like facet_wrap(~ sex + race) instead of facet_grid. Like facet_grid(), the facet_ wrap() function can facet on two or more variables at once. But it will do it by laying the results out in a wrapped one dimensional table instead of a fully cross-classified grid.

Age_Childs +  
  geom_point() +
  geom_smooth() + 
  facet_wrap(~sex + race)

A histogram bins observations for one variable and shows a bar with the count in each bin. We can do this for two variables at once, too. The geom_bin2d() function takes two mappings, x and y. It divides your plot into a grid and colors the bins by the count of observations in them. Try using it on the gapminder data to plot life expectancy versus per capita GDP. Like a histogram, you can vary the number or width of the bins for both x or y. Instead of saying bins = 30 or binwidth = 1, provide a number for both x and y with, for example, bins = c(20, 50). If you specify binwidth instead, you will need to pick values that are on the same scale as the variable you are mapping.

# store object 
life_expectancy_versus_per_capita_GDP <-  ggplot(data = gapminder, mapping = aes(x = lifeExp, y = gdpPercap))

# plot 
life_expectancy_versus_per_capita_GDP + geom_bin2d(bins=c(40,60)) 

Density estimates can also be drawn in two dimensions. The geom_density_2d() function draws contour lines estimating the joint distribution of two variables. Try it with the midwest data, for example, plotting percent below the poverty line (percbelowpoverty) against percent college-educated (percollege). Try it with and without a geom_point() layer.

percbelowpoverty_percollege <-  ggplot(data = midwest, mapping = aes(x = percbelowpoverty, y=percollege))
percbelowpoverty_percollege + geom_density_2d() + geom_point() + 
  labs(x = "% Below Poverty Line", y = "% Attended College", caption = 
         "Note: Demographic information of midwest counties\n from 2000 US census")  

Healy Chapter 5

“Graphs, tables, labels and notes”

Libraries used for Chapter 5 exercises

library(gapminder)
library(tidyverse)
library(socviz)

The following code chunks were developed as practice before answering the ‘Where to Next’ questions in the chapter.

rel_by_region <-  gss_sm %>%
group_by(bigregion, religion) %>%
summarize(N = n()) %>%
mutate(freq = N / sum(N),
pct = round((freq*100), 0))
rel_by_region %>% group_by(bigregion) %>% summarize(total = sum(pct))
## # A tibble: 4 × 2
##   bigregion total
##   <fct>     <dbl>
## 1 Northeast   100
## 2 Midwest     101
## 3 South       100
## 4 West        101
p <-  ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position='dodge2') +
labs(x = "Region",y = "Percent", fill = "Religion") +
theme(legend.position = "top")

p + geom_col(position = "dodge2") +
labs(x = NULL, y = "Percent", fill = "Religion") +
guides(fill = FALSE) +
coord_flip() +
facet_grid(~ bigregion)

organdata %>% select(1:6) %>% sample_n(size = 10)
## # A tibble: 10 × 6
##    country       year       donors    pop pop_dens   gdp
##    <chr>         <date>      <dbl>  <int>    <dbl> <int>
##  1 Denmark       1991-01-01   11.7   5154   12.0   19126
##  2 Finland       1993-01-01   19.6   5066    1.50  17082
##  3 Canada        NA           NA       NA   NA        NA
##  4 Netherlands   1993-01-01   15.4  15290   36.8   19856
##  5 United States 1998-01-01   21   275854    2.86  31612
##  6 Canada        1992-01-01   12.6  28377    0.285 19590
##  7 Italy         1992-01-01    5.8  56859   18.9   18883
##  8 Denmark       NA           NA     5141   11.9   18285
##  9 Netherlands   2000-01-01   12.6  15926   38.3   26873
## 10 France        1996-01-01   15.1  58026   10.5   21990
ggplot(organdata)+
  aes(x=year, y=donors)+
  geom_line(aes(group=country))+
  facet_wrap(~country)

p <-  ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm=TRUE),
y = donors, fill = world))
p + geom_boxplot() + labs(x=NULL) +
coord_flip() + theme(legend.position = "top")

by_country <-  organdata %>% group_by(consent_law, country) %>%
summarize_if(is.numeric, funs(mean, sd), na.rm = TRUE) %>%
ungroup()
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean)))
p + geom_point(size=3) +
facet_wrap(~ consent_law,scales = "free_y", ncol = 1) +
labs(x= "Donor Procurement Rate",
y= "")

library(ggrepel)
p <-  ggplot(data = by_country, mapping = aes(x = roads_mean,
y = donors_mean))
p + geom_text_repel(mapping = aes(label = country))

elections_historic %>% select(2:7)
## # A tibble: 49 × 6
##     year winner                 win_party ec_pct popular_pct popular_margin
##    <int> <chr>                  <chr>      <dbl>       <dbl>          <dbl>
##  1  1824 John Quincy Adams      D.-R.      0.322       0.309        -0.104 
##  2  1828 Andrew Jackson         Dem.       0.682       0.559         0.122 
##  3  1832 Andrew Jackson         Dem.       0.766       0.547         0.178 
##  4  1836 Martin Van Buren       Dem.       0.578       0.508         0.142 
##  5  1840 William Henry Harrison Whig       0.796       0.529         0.0605
##  6  1844 James Polk             Dem.       0.618       0.495         0.0145
##  7  1848 Zachary Taylor         Whig       0.562       0.473         0.0479
##  8  1852 Franklin Pierce        Dem.       0.858       0.508         0.0695
##  9  1856 James Buchanan         Dem.       0.588       0.453         0.122 
## 10  1860 Abraham Lincoln        Rep.       0.594       0.396         0.101 
## # … with 39 more rows
p_title <-  "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <-  "1824-2016"
p_caption <-  "Data for 2016 are provisional."
x_label <-  "Winner's share of Popular Vote"
y_label <-  "Winner's share of Electoral College Votes"
p <-  ggplot(elections_historic, aes(x = popular_pct, y = ec_pct,
label = winner_label))
p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
geom_point() +
geom_text_repel() +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
caption = p_caption)

5.7 Where to Go Next

We covered several new functions and data aggregation techniques in this chapter. You should practice working with them.

The subset() function is very useful when used in conjunction with a series of layered geoms. Go back to your code for the presidential elections plot (fig. 5.18) and redo it so that it shows all the data points but only labels elections since 1992. You might need to look again at the elections_historic data to see what variables are available to you. You can also experiment with subsetting by political party, or changing the colors of the points to reflect the winning party.

by_year <- ggplot(subset(elections_historic, reorder= year>1991))
# subset the elections data; year greater than 1988 as next election was 1992 
after_1992 <- elections_historic %>% group_by(year) %>% subset(year>1988) 

# pre-setting labels as objects so we can use later 
p_title <-  "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <-  "1824-2016"
p_caption <-  "Data for 2016 are provisional."
x_label <-  "Winner's share of Popular Vote"
y_label <-  "Winner's share of Electoral College Votes"

p <-  ggplot(data=elections_historic, aes(x = popular_pct, y = ec_pct,
label = winner_label, color=win_party, shape=win_party))

p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
  geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
  geom_point() + geom_smooth()+
  geom_text_repel(data=subset(after_1992)) +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) + labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,caption = p_caption)

after_1992 <- elections_historic %>% group_by(year) %>% subset(year>1988)
p_title <-  "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <-  "1824-2016"
p_caption <-  "Data for 2016 are provisional."
x_label <-  "Winner's share of Popular Vote"
y_label <-  "Winner's share of Electoral College Votes"
p <-  ggplot(data=elections_historic, aes(x = popular_pct, y = ec_pct,
label = winner_label, color=win_party, shape=win_party))
p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
geom_point() +
geom_text_repel(data=subset(after_1992)) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
caption = p_caption)

p_title <-  "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <-  "1824-2016"
p_caption <-  "Data for 2016 are provisional."
x_label <-  "Winner's share of Popular Vote"
y_label <-  "Winner's share of Electoral College Votes"
p <-  ggplot(data=elections_historic, aes(x = popular_pct, y = ec_pct,
label = winner_label, color=win_party, shape=win_party))
p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
geom_point() +
geom_text_repel(data=subset(after_1992)) +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
caption = p_caption)

Try using annotate() to add a rectangle that lightly colors the entire upper left quadrant of figure 5.18.

p_title <-  "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <-  "1824-2016"
p_caption <-  "Data for 2016 are provisional."
x_label <-  "Winner's share of Popular Vote"
y_label <-  "Winner's share of Electoral College Votes"
p <-  ggplot(elections_historic, aes(x = popular_pct, y = ec_pct,
label = winner_label))
p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
geom_point() +
geom_text_repel() +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::percent) +
labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
caption = p_caption)+ 
  annotate("rect", xmin = 0.3, xmax=0.5, ymin =0.5, ymax=1, fill="red", alpha=0.1)

The main action verbs in the dplyr library are group_by(), filter(), select(), summarize(), and mutate(). Practice with them by revisiting the gapminder data to see if you can reproduce a pair of graphs from chapter 1, shown here again in figure 5.28. You will need to filter some rows, group the data by continent, and calculate the mean life expectancy by continent before beginning the plotting process.

library(ggthemes) # used below for the minimal theme 

# first, using mutate to calculate avg. life expectancy 
gapminder %>% 
  filter(year == 2007) %>% 
  select(continent, lifeExp) %>% 
  group_by(continent) %>% 
  mutate(continent_mean = mean(lifeExp)) %>% 
  ggplot(aes(x = continent_mean, y = reorder(continent, continent_mean))) + geom_point() + 
  labs(title = "Mean Life Expectancy in 2007 by Continent", x = "Life Expectancy (Years)", 
       y = "Continent") + theme_minimal() 

# same plot as above but using summarize instead of mutate to calculate means 
gapminder %>% 
  filter(year == 2007) %>% 
  select(continent, lifeExp) %>% group_by(continent) %>% 
  summarize(exp_mean = mean(lifeExp)) %>%
  ggplot(aes(x = exp_mean, y = reorder(continent, exp_mean))) + geom_point() + 
  labs(title = "Mean Life Expectancy in 2007 by Continent", x = "Life Expectancy (Years)", 
       y = "Continent") + theme_minimal() 

Get comfortable with grouping, mutating, and summarizing data in pipelines. This will become a routine task as you work with your data. There are many ways that tables can be aggregated and transformed. Remember, group_by() groups your data from left to right, with the rightmost or innermost group being the level calculations will be done at; mutate() adds a column at the current level of grouping; and summarize() aggregates to the next level up. Try creating some grouped objects from the GSS data, calculating frequencies as you learned in this chapter, and then check to see if the totals are what you expect. For example, start by grouping degree by race, like this:

# Healy example: 
gss_sm %>% group_by(race, degree) %>%
    summarize(N = n()) %>%
    mutate(pct = round(N / sum(N)*100, 0)) 
## # A tibble: 18 × 4
## # Groups:   race [3]
##    race  degree             N   pct
##    <fct> <fct>          <int> <dbl>
##  1 White Lt High School   197     9
##  2 White High School     1057    50
##  3 White Junior College   166     8
##  4 White Bachelor         426    20
##  5 White Graduate         250    12
##  6 White <NA>               4     0
##  7 Black Lt High School    60    12
##  8 Black High School      292    60
##  9 Black Junior College    33     7
## 10 Black Bachelor          71    14
## 11 Black Graduate          31     6
## 12 Black <NA>               3     1
## 13 Other Lt High School    71    26
## 14 Other High School      112    40
## 15 Other Junior College    17     6
## 16 Other Bachelor          39    14
## 17 Other Graduate          37    13
## 18 Other <NA>               1     0
gss_sm %>% group_by(race, degree) %>% summarize(N = n()) %>%
mutate(pct = round(N/sum(N) * 100, 0)) %>% summarize(total = sum(pct))
## `summarise()` has grouped output by 'race'. You can override using the `.groups`
## argument.
## # A tibble: 3 × 2
##   race  total
##   <fct> <dbl>
## 1 White    99
## 2 Black   100
## 3 Other    99

As expected, this sums to 100 within each race, with a bit of rounding error.

This code is similar to what you saw earlier but more compact. (We calculate the pct values directly.) Check that the results are as you expect by grouping by race and summing the percentages. Try doing the same exercise grouping by sex or region.

gss_sm %>% group_by(region, degree) %>% summarize(N = n()) %>%
mutate(pct = round(N/sum(N) * 100, 0)) 
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
## # A tibble: 49 × 4
## # Groups:   region [9]
##    region          degree             N   pct
##    <fct>           <fct>          <int> <dbl>
##  1 New England     Lt High School    13     7
##  2 New England     High School       77    44
##  3 New England     Junior College    19    11
##  4 New England     Bachelor          39    22
##  5 New England     Graduate          27    15
##  6 Middle Atlantic Lt High School    25     8
##  7 Middle Atlantic High School      160    51
##  8 Middle Atlantic Junior College    21     7
##  9 Middle Atlantic Bachelor          59    19
## 10 Middle Atlantic Graduate          47    15
## # … with 39 more rows
gss_sm %>% group_by(region, degree) %>% summarize(N = n()) %>%
mutate(pct = round(N/sum(N) * 100, 0))%>% summarize(total = sum(pct))
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
## # A tibble: 9 × 2
##   region          total
##   <fct>           <dbl>
## 1 New England        99
## 2 Middle Atlantic   100
## 3 E. Nor. Central   100
## 4 W. Nor. Central   100
## 5 South Atlantic    100
## 6 E. Sou. Central   100
## 7 W. Sou. Central   100
## 8 Mountain          100
## 9 Pacific           100

As expected, this sums to 100 within each region, with a bit of rounding error.

Try summary calculations with functions other than sum. Can you calculate the mean and median number of children by degree? (Hint: the childs variable in gss_sm has children as a numeric value.)

gss_sm %>% group_by(degree) %>% 
  summarize(childsmean=mean(childs, na.rm=TRUE), 
            childsmedian=median(childs, na.rm=TRUE))
## # A tibble: 6 × 3
##   degree         childsmean childsmedian
##   <fct>               <dbl>        <dbl>
## 1 Lt High School       2.81            3
## 2 High School          1.86            2
## 3 Junior College       1.77            2
## 4 Bachelor             1.45            1
## 5 Graduate             1.52            2
## 6 <NA>                 3.6             4

dplyr has a large number of helper functions that let you summarize data in many different ways. The vignette on window functions included with the dplyr documentation is a good place to begin learning about these. You should also look at chapter 3 of Wickham & Grolemund (2016) for more information on transforming data with dplyr.

Experiment with the gapminder data to practice some of the new geoms we have learned. Try examining population or life expectancy over time using a series of boxplots. (Hint: you may need to use the group aesthetic in the aes() call.) Can you facet this boxplot by continent? Is anything different if you create a tibble from gapminder that explicitly groups the data by year and continent first, and then create your plots with that?

# boxplot with life expectancy over time (years)
ggplot(gapminder, aes(x=year, y=lifeExp, group=continent)) + 
  geom_boxplot() 

# same as above but first grouping by year and continent first and grouping by year to show life expectancy distributions within continent across years 
gapminder %>% 
  group_by(continent, year) %>% 
  ggplot(aes(x = year, y = lifeExp, group = year)) + geom_boxplot() + facet_wrap(~continent) 

Read the help page for geom_boxplot() and take a look at the notch and varwidth options. Try them out to see how they change the look of the plot.

#?geom_boxplot() # view geom_boxplot() documentation 

# notched boxplot 
gapminder %>% 
  filter(continent == "Europe") %>% 
  ggplot(aes(x  = year, y = lifeExp, group = year)) + geom_boxplot(notch = TRUE)

# adjust varwidth options 
gapminder %>% 
  filter(continent == "Europe") %>% 
  ggplot(aes(x  = year, y = lifeExp, group = year)) + geom_boxplot(varwidth = TRUE)

As an alternative to geom_boxplot(), try geom_violin() for a similar plot but with a mirrored density distribution instead of a box and whiskers.

# violin plot 1 
ggplot(gapminder, aes(x=year, y=lifeExp, group= year, fill=continent)) + 
  geom_violin() + facet_wrap(~continent) 

# violin plot 2

gapminder %>%
  filter(year %in% c("1957", "2007"), !continent == "Oceania") %>% 
  ggplot(aes(x = year, y = lifeExp, group = year, fill = year)) + geom_violin() + facet_wrap(~continent) + 
  theme(legend.position = "none") + labs(title = "Life Expectancy in 1952 and 2007 by Continent", 
                                         x = "Year", y = "Life Expectancy") 

Bulut & Desjadins - Chapter 5

Visulizing Big Data

Load relevant packages

library("ggplot2")
library("GGally")
library("ggExtra")
library("ggalluvial")
library("plotly")
library("cowplot")
library("data.table")

First, we retrieve region6 which is a subset of the pisa data set. Additionally, before answering the questions in the chapter, we perform code cleaning tasks that were applied to the data set in Ch.4 to ready the data for visualization.

pisa <- fread("random6.csv", na.strings = "") 
#pisa <- fread("region6.csv",na.strings = "")

# check data
#head(pisa)  # not print for the HTML file as it returns a large output 

Recode variables

bin.to.num <- function(x) {
  if (is.na(x)) NA
  else if (x =="Yes") 1L
  else if (x == "No") 0L
  
}


#Create and recode variables
pisa[, `:=` 
     (female = ifelse(ST004D01T == "Female", 1, 0),
       sex = ST004D01T,
       
       # At my house we have ...
       desk = sapply(ST011Q01TA, bin.to.num),
       own.room = sapply(ST011Q02TA, bin.to.num),
       quiet.study = sapply(ST011Q03TA, bin.to.num),
       computer = sapply(ST011Q04TA, bin.to.num),
       software = sapply(ST011Q05TA, bin.to.num),
       internet = sapply(ST011Q06TA, bin.to.num),
       lit = sapply(ST011Q07TA, bin.to.num),
       poetry = sapply(ST011Q08TA, bin.to.num),
       art = sapply(ST011Q09TA, bin.to.num),
       book.sch = sapply(ST011Q10TA, bin.to.num),
       tech.book = sapply(ST011Q11TA, bin.to.num),
       dict = sapply(ST011Q12TA, bin.to.num),
       art.book = sapply(ST011Q16NA, bin.to.num))]

pisa[, `:=`
     (math = rowMeans(pisa[, c(paste0("PV", 1:10, "MATH"))], na.rm = TRUE),reading = rowMeans(pisa[, c(paste0("PV", 1:10, "READ"))], na.rm = TRUE), science = rowMeans(pisa[, c(paste0("PV", 1:10, "SCIE"))], na.rm = TRUE))]

Filter big data based on a list of countries (called “country”)

country <- c("United States", "Canada", "Mexico", "B-S-J-G (China)", "Japan",
             "Korea", "Germany", "Italy", "France", "Brazil", "Colombia", "Uruguay",
             "Australia", "New Zealand", "Jordan", "Israel", "Lebanon")

dat <- pisa[CNT %in% country,
            .(CNT, OECD, CNTSTUID, W_FSTUWT, sex, female,
              ST001D01T, computer, software, internet,
              ST011Q05TA, ST071Q02NA, ST071Q01NA, ST123Q02NA,
              ST082Q01NA, ST119Q01NA, ST119Q05NA, ANXTEST,
              COOPERATE, BELONG,  EMOSUPS, HOMESCH, ENTUSE,
              ICTHOME, ICTSCH, WEALTH, PARED, TMINS, ESCS,
              TEACHSUP, TDTEACH, IBTEACH, SCIEEFF,
              math, reading, science)
            ]

Create additional variables for data visualization

dat <- dat[, `:=` (
  # New grade variable
  grade = (as.numeric(sapply(ST001D01T, function(x) {
  if(x=="Grade 7") "7"
  else if (x=="Grade 8") "8"
  else if (x=="Grade 9") "9"
  else if (x=="Grade 10") "10"
  else if (x=="Grade 11") "11"
  else if (x=="Grade 12") "12"
  else if (x=="Grade 13") NA_character_
  else if (x=="Ungraded") NA_character_}))),
  # Total learning time as hours
  learning = round(TMINS/60, 0),
  # Regions for selected countries
  Region = (sapply(CNT, function(x) {
  if(x %in% c("Canada", "United States", "Mexico")) "N. America"
    else if (x %in% c("Colombia", "Brazil", "Uruguay")) "S. America"
    else if (x %in% c("Japan", "B-S-J-G (China)", "Korea")) "Asia"
    else if (x %in% c("Germany", "Italy", "France")) "Europe"
    else if (x %in% c("Australia", "New Zealand")) "Australia"
    else if (x %in% c("Israel", "Jordan", "Lebanon")) "Middle-East"
    }))
  )]

# N count for the final dataset
dat[,.N]
## [1] 35847
#158,061 rows

# Preview final data
head(dat)
##        CNT OECD CNTSTUID  W_FSTUWT    sex female ST001D01T computer software
## 1: Germany  Yes 27605365  94.71886 Female      1   Grade 9        1        1
## 2: Germany  Yes 27601858  94.71886 Female      1   Grade 9        1        1
## 3: Germany  Yes 27602008  94.71886 Female      1   Grade 9        1        0
## 4: Germany  Yes 27601867 101.08900   Male      0   Grade 9        1        1
## 5: Germany  Yes 27601457 101.08900   Male      0   Grade 9        1        1
## 6: Germany  Yes 27609865 101.08900   Male      0   Grade 9        1        1
##    internet ST011Q05TA ST071Q02NA ST071Q01NA     ST123Q02NA     ST082Q01NA
## 1:        1        Yes          2          1          Agree          Agree
## 2:        1        Yes          6          4 Strongly agree          Agree
## 3:        1         No          1          1 Strongly agree          Agree
## 4:        1        Yes          1          1 Strongly agree Strongly agree
## 5:        1        Yes          1          0 Strongly agree Strongly agree
## 6:        1        Yes          7          1          Agree Strongly agree
##           ST119Q01NA        ST119Q05NA ANXTEST COOPERATE  BELONG EMOSUPS
## 1: Strongly disagree Strongly disagree -0.0345   -0.2882  0.0171 -1.5266
## 2:          Disagree          Disagree -0.5792    1.0205  0.6642  1.0991
## 3:    Strongly agree             Agree -0.4331    2.2879  1.5043  1.0991
## 4:             Agree          Disagree  1.8135    0.9675  0.0048  1.0991
## 5:             Agree             Agree  0.6007    0.7898 -1.7499  1.0991
## 6:    Strongly agree             Agree -0.0582    0.6615  0.9357  0.0250
##    HOMESCH  ENTUSE ICTHOME ICTSCH  WEALTH PARED TMINS    ESCS TEACHSUP TDTEACH
## 1: -0.5712 -0.4097      NA     NA  1.0228    15  1350  0.3306  -0.8040 -0.8671
## 2: -0.8913 -0.7527      NA     NA -0.2249    13  1620 -0.4779   0.4879 -0.6849
## 3:  0.3373  0.4811      NA     NA  1.3528    18    NA  0.9317   1.4475  0.5252
## 4:  0.0631 -0.1772      NA     NA  2.3511    13  1620  0.6614   1.4475  0.9477
## 5:  0.1053 -0.0269      NA     NA  0.3228    10  1620 -0.1385   0.0838  2.0781
## 6: -0.2290  0.3594      NA     NA  0.2763    15  1620  0.2320  -0.3478  0.4505
##    IBTEACH SCIEEFF     math  reading  science grade learning Region
## 1: -0.6081      NA 487.5435 529.1613 502.8598     9       22 Europe
## 2: -0.1573 -2.8303 464.7986 516.5613 487.9087     9       27 Europe
## 3:  0.9882 -2.0922 595.8506 657.5961 579.7019     9       NA Europe
## 4:  2.4937 -0.4044 577.6913 546.5353 553.7394     9       27 Europe
## 5:  2.3047  0.1155 519.0156 471.8701 467.8655     9       27 Europe
## 6:  0.5711  0.3155 541.2141 540.0169 538.2691     9       27 Europe

Exercise 5.2.1

A. Create a plot of math scores over countries with different colors based on region. How long did it take to create both plots? Which one is a better way to visualize this type of data?

  • Both plots were easy to generate and did not take long
  • The second plot is a better way to visualize this type of data
  • While the first plot provides a clear visual of the spread of scores in each country, in comparison to the average score,
  • The second plot also allows us to visualize the density of that spread in each country
ggplot(data = dat,
       mapping = aes(x = reorder(CNT, math), y = math, fill = Region)) +
  geom_point(aes(color = Region)) +
  labs(x=NULL, y="Math Scores") +
  coord_flip() +
  geom_hline(yintercept = 490, linetype="dashed", color = "red", size = 1) +
  theme_bw()

ggplot(data = dat,
       mapping = aes(x = reorder(CNT, math), y = math, fill = Region)) +
 geom_violin(aes(color = Region))+
  labs(x=NULL, y="Math Scores") +
  coord_flip() +
  geom_hline(yintercept = 490, linetype="dashed", color = "red", size = 1) +
  theme_bw()

  • Creating histograms for a particular variable and splitting the plot into multiple plots by using a categorical, group variable (gender)
ggplot(data = dat,
       mapping = aes(x = science, fill = Region)) +
  geom_histogram(alpha = 0.5, bins = 50) +
  labs(x = "Science Scores", y = "Count",
       title = "Science Scores by Gender and Region") +
  facet_grid(. ~ sex) +
  theme_bw()

# Random sample of 500 students from each region
dat_small <- dat[,.SD[sample(.N, min(500,.N))], by = Region]

ggpairs(data = dat_small,
        mapping = aes(color = Region),
        columns = c("reading", "science", "math", "sex"),
        upper = list(continuous = wrap("cor", size = 2.5))
        )

Visualizing multiple variables

Interpretation:

  • What can we say about the regions based on the plots below?
    • First, in the ggppairs visualization, we see that across all regions, there is a strong positive correlation between students’ science, reading and math scores.
    • In addition, while difficult to distinguish between regions in the scatter plots in the ggpairs chart above, we see the same strong positive correlations visually indicated.
  • Do you see any major gender differences for reading, science, or math?
    • Visual inspection of the boxplots in the ggpairs visualization revealed that the median math scores for male students were greater in almost all of the world regions in this data set. For reading, female student reading scores were greater than male student scores in Europe, Australia, and Asia. That being said, these differences appear small via visual inspection and further statistical analysis would be necessary to support these conclusions.
  • What is the relationship among reading, science, or math?
    • As discussed above the relationship between reading, science, and math are all strong and positive (r = 0.89 to r = 0.92).

Exercise 5.3.1

Create a scatte rplot of socioeconomic status (ESCS) and math scores (math) across regions (region) and gender (sex).

ggplot(data = dat_small,
       mapping = aes(x = ESCS, y = math)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Socio-Economic Status", y = "Math Scores") +
  theme_bw() +
  theme(legend.title = element_blank()) +
  facet_grid(sex ~ Region)

Do you think that the relationship between ESCS and math changes across gender and regions?

  • Overall, across all regions and both genders, socio-economic status holds a positive relationship with Math scores. Specifically, higher math scores are associated with higher levels of SES across regions and genders.
  • That being said, the relationship does appear to be different across region. For example, the relationship between math scores and SES is stronger in Asia than it is in N. America. Thus, it is likely that region moderates the relationship between math and SES, so yes we believe it changes across region.
  • Additionally, we see that mean math scores are greater for some regions than for others. For example, in Asia the greater mean math score appears to be larger than that of North America.
  • For gender, it is less clear if the relationship between math and SES changes by gender. Visual inspection does not reveal noticeable patterns of differences between male and female students on the relationship between math and SES.

Exercise 5.6.1

Create an alluvial plot for the survey item (ST123Q02NA) of parent support for educational achievement by region (Region) and gender (sex).

Below we create the summary data set (dat_alluvial2) for this plot.

dat_alluvial2 <- dat[,
                     .(Freq = .N),
                     by = c("Region", "sex", "ST123Q02NA")
                     ][,
                       ST119Q01NA := as.factor(ifelse(ST123Q02NA == "", "Missing", ST123Q02NA))]

levels(dat_alluvial2$ST123Q02NA) <- c("Strongly disagree", "Disagree", "Agree",
                                      "Strongly agree", "Missing")

head(dat_alluvial2)
##    Region    sex     ST123Q02NA Freq     ST119Q01NA
## 1: Europe Female          Agree  822          Agree
## 2: Europe Female Strongly agree 1775 Strongly agree
## 3: Europe   Male Strongly agree 1645 Strongly agree
## 4: Europe   Male          Agree  910          Agree
## 5: Europe Female       Disagree  183       Disagree
## 6: Europe   Male           <NA>  546           <NA>
dat_alluvial2 %>% arrange(desc(Freq)) 
##          Region    sex        ST123Q02NA Freq        ST119Q01NA
##  1: Middle-East Female              <NA> 2451              <NA>
##  2:  N. America Female    Strongly agree 2312    Strongly agree
##  3:  N. America   Male    Strongly agree 2113    Strongly agree
##  4: Middle-East   Male              <NA> 2095              <NA>
##  5:  S. America Female    Strongly agree 1883    Strongly agree
##  6:      Europe Female    Strongly agree 1775    Strongly agree
##  7:        Asia   Male             Agree 1742             Agree
##  8:  S. America   Male    Strongly agree 1716    Strongly agree
##  9:      Europe   Male    Strongly agree 1645    Strongly agree
## 10:        Asia Female    Strongly agree 1582    Strongly agree
## 11:        Asia Female             Agree 1467             Agree
## 12:   Australia Female    Strongly agree 1446    Strongly agree
## 13:   Australia   Male    Strongly agree 1372    Strongly agree
## 14:  N. America   Male             Agree 1223             Agree
## 15:        Asia   Male    Strongly agree 1192    Strongly agree
## 16:  N. America Female             Agree 1115             Agree
## 17:  S. America Female             Agree  976             Agree
## 18:      Europe   Male             Agree  910             Agree
## 19:  S. America   Male             Agree  892             Agree
## 20:      Europe Female             Agree  822             Agree
## 21:   Australia   Male             Agree  709             Agree
## 22:   Australia Female             Agree  649             Agree
## 23:      Europe   Male              <NA>  546              <NA>
## 24:      Europe Female              <NA>  376              <NA>
## 25:        Asia   Male          Disagree  256          Disagree
## 26:  N. America   Male Strongly disagree  238 Strongly disagree
## 27:  N. America Female Strongly disagree  229 Strongly disagree
## 28:        Asia Female          Disagree  198          Disagree
## 29:      Europe Female          Disagree  183          Disagree
## 30:      Europe   Male          Disagree  175          Disagree
## 31:  N. America   Male          Disagree  139          Disagree
## 32:  S. America   Male              <NA>  131              <NA>
## 33:  N. America Female          Disagree  119          Disagree
## 34:        Asia   Male Strongly disagree  111 Strongly disagree
## 35:  S. America Female          Disagree  108          Disagree
## 36:  S. America Female              <NA>  108              <NA>
## 37:  S. America   Male          Disagree   90          Disagree
## 38:   Australia   Male              <NA>   82              <NA>
## 39:  S. America Female Strongly disagree   79 Strongly disagree
## 40:  S. America   Male Strongly disagree   79 Strongly disagree
## 41:   Australia   Male Strongly disagree   62 Strongly disagree
## 42:   Australia Female Strongly disagree   55 Strongly disagree
## 43:   Australia Female              <NA>   54              <NA>
## 44:   Australia Female          Disagree   53          Disagree
## 45:  N. America   Male              <NA>   52              <NA>
## 46:        Asia Female Strongly disagree   46 Strongly disagree
## 47:      Europe Female Strongly disagree   41 Strongly disagree
## 48:   Australia   Male          Disagree   38          Disagree
## 49:        Asia   Male              <NA>   37              <NA>
## 50:      Europe   Male Strongly disagree   31 Strongly disagree
## 51:  N. America Female              <NA>   28              <NA>
## 52:        Asia Female              <NA>   16              <NA>
##          Region    sex        ST123Q02NA Freq        ST119Q01NA
# cleaned colum names 
dat_alluvial2 <- dat_alluvial2 %>% 
  rename(parent_support = ST123Q02NA) %>%
  mutate(sex = as.factor(sex), 
         Region = as.factor(Region), 
         parent_support = as.factor(parent_support)) %>% 
  filter(!is.na(parent_support)) 

dat_alluvial2
##         Region    sex    parent_support Freq        ST119Q01NA
##  1:     Europe Female             Agree  822             Agree
##  2:     Europe Female    Strongly agree 1775    Strongly agree
##  3:     Europe   Male    Strongly agree 1645    Strongly agree
##  4:     Europe   Male             Agree  910             Agree
##  5:     Europe Female          Disagree  183          Disagree
##  6:     Europe   Male          Disagree  175          Disagree
##  7:     Europe Female Strongly disagree   41 Strongly disagree
##  8:     Europe   Male Strongly disagree   31 Strongly disagree
##  9:       Asia Female    Strongly agree 1582    Strongly agree
## 10:       Asia Female             Agree 1467             Agree
## 11:       Asia   Male             Agree 1742             Agree
## 12:       Asia   Male          Disagree  256          Disagree
## 13:       Asia   Male    Strongly agree 1192    Strongly agree
## 14:       Asia Female Strongly disagree   46 Strongly disagree
## 15:       Asia Female          Disagree  198          Disagree
## 16:       Asia   Male Strongly disagree  111 Strongly disagree
## 17: N. America   Male             Agree 1223             Agree
## 18: N. America   Male    Strongly agree 2113    Strongly agree
## 19: N. America Female             Agree 1115             Agree
## 20: N. America   Male Strongly disagree  238 Strongly disagree
## 21: N. America Female    Strongly agree 2312    Strongly agree
## 22: N. America Female          Disagree  119          Disagree
## 23: N. America   Male          Disagree  139          Disagree
## 24: N. America Female Strongly disagree  229 Strongly disagree
## 25:  Australia Female    Strongly agree 1446    Strongly agree
## 26:  Australia   Male             Agree  709             Agree
## 27:  Australia Female             Agree  649             Agree
## 28:  Australia Female          Disagree   53          Disagree
## 29:  Australia   Male    Strongly agree 1372    Strongly agree
## 30:  Australia   Male          Disagree   38          Disagree
## 31:  Australia   Male Strongly disagree   62 Strongly disagree
## 32:  Australia Female Strongly disagree   55 Strongly disagree
## 33: S. America Female             Agree  976             Agree
## 34: S. America Female    Strongly agree 1883    Strongly agree
## 35: S. America   Male    Strongly agree 1716    Strongly agree
## 36: S. America   Male             Agree  892             Agree
## 37: S. America Female Strongly disagree   79 Strongly disagree
## 38: S. America   Male          Disagree   90          Disagree
## 39: S. America Female          Disagree  108          Disagree
## 40: S. America   Male Strongly disagree   79 Strongly disagree
##         Region    sex    parent_support Freq        ST119Q01NA

Next, we use this data set to draw the alluvial plot plot.

# StatStratum <- StatStratum

ggplot(data = dat_alluvial2,
       aes(axis1 = Region, axis2 = parent_support, y = Freq, label = after_stat(stratum))) +
  scale_x_discrete(limits = c("Region", "Parents supporting\nachievement"),
                   expand = c(.1, .05)) +
  geom_alluvium(aes(fill = sex)) +
  geom_stratum() +
  geom_text(stat = "stratum", label.strata = TRUE) +
  labs(x = "Demographics", y = "Frequency", fill = "Gender") +
  theme_bw()

How should we interpret the plot (e.g., for each region)?

There are multiple trends that are apparent from the alluvial plot. First, we see that there is variation within all regions for parental support. That is, we see that each region has different levels of parent support for their students. Second, in comparing regions, we see that some regions have, on average, a higher level of parental support for their students. For example, in the bottom region on the plot, we see that by and large, the region has a lower level of parent support than, say, the region at the top of the plot. Finally, while within each region there is variation on parental support, the proportion of each level of parental support remains consistent between genders in each region.

Exercise 5.7.1

Replicate the science-by-region histogram below as a density plot and use plotly to make it interactive.

Science-by-Region Histogram:

ggplot(data = dat,
       mapping = aes(x = science, fill = Region)) +
  geom_histogram(alpha = 0.5, bins = 50) +
  labs(x = "Science Scores", y = "Count",
       title = "Science Scores by Gender and Region") +
  facet_grid(. ~ sex) +
  theme_bw()

Next, we present a science-by-Region Density Plot made interactive using plotly with alpha = 0.5:

p5 <- ggplot(data = dat,
       mapping = aes(x = science, fill = Region)) +
  geom_density(alpha = 0.5) +
  labs(x = "Science Scores", y = "Count",
       title = "Science Scores by Gender and Region") +
  facet_grid(. ~ sex) +
  theme_bw()

ggplotly(p5)

Science-by-Region Density Plot made interactive using plotly with alpha = 0.8:

p6 <- ggplot(data = dat,
       mapping = aes(x = science, fill = Region)) +
  geom_density(alpha = 0.8) +
  labs(x = "Science Scores", y = "Count",
       title = "Science Scores by Gender and Region") +
  facet_grid(. ~ sex) +
  theme_bw()

ggplotly(p6)

Which version is better for examining the science score distribution?

  • The alpha = 0.5 version is better for reviewing the science score distribution as transparency allows for a better view of the full distribution of the plot of the different regions.
  • In contrast, the opacity of the alpha = 0.8 plot conceals much of the plot of the regions plotted behind. While the interactive elements in the plotly viz are captivating, it is more difficult to draw comparisons between the regional distributions in this chart than the first chart.

Exercise 5.9

  • Plot 1: Reading Scores by Gender and Region
ggplot(data = dat,
       mapping = aes(x = reading, fill = Region)) +
  geom_histogram(alpha = 0.5, bins = 50) +
  labs(x = "Reading Scores", y = "Count",
       title = "Reading Scores by Gender and Region") +
  facet_grid(. ~ sex) +
  theme_bw()

  • Plot 2: Interactive Density Plot of Reading Scores by Gender and Region
p8 <- ggplot(data = dat,
       mapping = aes(x = reading, fill = Region)) +
  geom_density(alpha = 0.5) +
  labs(x = "Reading Scores", y = "Count",
       title = "Reading Scores by Gender and Region") +
  facet_grid(. ~ sex) +
  theme_bw()

ggplotly(p8)

Tidy Tuesday Challenge

In a Tidy Tuesday screen cast originally presented in February 2021, David Robinson explored an open-source data set from the National Science Foundation (NSF) that contains the number of PhDs awarded each year by academic discipline. For this assignment, we have revised the below visualizations that David Robinson created during the screen cast to show the number of PhDs awarded by field over time.

A. The original data visualization that inspired that week of Tidy Tuesday.

C. A paragraph on your plan to improve the data visualization.

We propose two main steps to improve David Robinson’s line graph of PhDs awarded by field over time. First, to enhance the line graph’s ability to convey differences in degrees awarded by field over time, we will apply principles of Gestalt psychology to the visualization. In particular, we plan to apply five principles, each of which are discussed in greater detail below in our section on the application of Gestalt theory. Second, to enhance the visualization’s ability to convey a meaningful story, we have decided to limit the overall dates presented in the overall data and to focus on one academic discipline as opposed to all 7 broad fields. Our processes for accomplishing each of these goals is detailed in section E below.

D. The code and your improved data visualization.

I. Data Cleaning

# rename field to 'minor field' 
phd_clean <- phd_df %>% 
  rename(minor_field = field) 

# store field as a factor below so that it's a group not character string and rename to match 
# David Robinson's naming convention from his Tidy Tuesday screencast 
phd_clean <- phd_clean %>% 
  mutate(field = as.factor(broad_field), 
         year = as.factor(year))  

# for our line plot, we will visualize the average degrees awarded each year by field 
phds_by_year <- phd_clean %>% 
  filter(!is.na(n_phds)) %>%
  group_by(year, field) %>% 
  summarize(num_degrees = mean(n_phds)) 
  1. Improved visualization
library(ggtext) # used to successfully render HTML code for color words in title 

phds_by_year %>%
  mutate(highlight = ifelse(field=="Education", "Education", "Other Fields")) %>%
  ggplot(aes(x=year, y=num_degrees, group= field, color=highlight, size=highlight)) +
    geom_line() +
    scale_color_manual(values = c("#69b3a2", "grey50")) +
    scale_size_manual(values=c(1.5,0.2)) +
    theme(legend.position="none") +
  labs(
    title = "<strong><span style='color:#000000'>The number of PhDs in <strong><span style='color:#69b3a2'>education </strong></b>have decreased in recent years", x = "Year", y = "Degrees Awarded")  +
    geom_label(x= 4, y = 185,label="Education PhDs in decline", size=4, color="#69b3a2") +
      theme(
      legend.position="none",
      plot.title = element_text(size=14)) + theme(plot.title = element_markdown())

E. A few paragraphs on how the data visualization addresses the issues with Gestalt theory and the visual perception research.

As mentioned in our plan above, we applied five principles for enhancing data visualizations vis-a-vis Schwabish (2021), whose work draws on insights from Gestalt theory and visual perception research. The following paragraphs state the principle that we used and discuss our process of applying the principle to enhance the original visualization.

The first two principles that we leveraged from Schwabish (2021)‘s writing on Gestalt and visual perception research were to ’show the data’ and ‘start with gray’. By showing the data, Schwabish recommends that analysts should, when possible, limit the salience of data that is not relevant to the analysis in focus. For our Tidy Tuesday revision, while we enjoyed seeing Robinson’s trend lines for all academic disciplines, we felt that the chart was bordering on what would be considered a ‘spaghetti chart’, That is, we felt that Robinson’s visualization of 7 simultaneous trend lines placed too much cognitive demand on the audience. Thus, in applying the ‘show the data’ principle, we chose to highlight just one trend line, PhDs awarded in education. To do so, we isolated the education trends line through the application of the second principle of ‘starting with gray’. For this principle, Schwabish suggests that by making all chart elements gray at the outset of a project, the analyst becomes more strategic in their application of color to the chart. This was true for our Tidy Tuesday revision. By starting with gray, we forced ourselves to restrict our use of color in order to make the education trend line stand out.

The next principle from Schwabish (2021) that we applied was to integrate chart graphics and text. Specifically, by focusing on our selective use of color, we formatted the word ‘education’ in our chart title to match the color of the education trend line. This principle of integrating chart and text elements connects to Gestalt theorist’ work on similarity and pre-attentive processing. Gestalt theories on similarity draw on the general idea that humans tend to perceive elements that share similar characteristics as being related to one another. Building off of similarity, pre-attentive processing describes unconscious cognitive processes that occur in relation to patterns in the visual field. Analysts can leverage the Gestalt notion of similarity, say, to prime an audience to pre-attentively perceive connections between otherwise disparate elements in a visual field. For our project, our selective use of color for the chart title and trend line was a deliberate attempt to prime our audience to unconsciously perceive the connection between these features of our visualization.

Having integrated chart and text elements, we were then ready to reduce clutter, a fourth principle of Schwabish’s. Scwabish (2021) suggests that the unnecessary use of visualization elements can be distracting to readers. We identified that while the original visualization’s use of a legend in the line plot was necessary for identifying the academic disciplines in Robinson’s plot, we no longer had a need for a legend following the application of the aforementioned principles. Finally, after the above processes, our last enhancement to the original Tidy Tuesday visualization was to apply a title that was written like a headline. Schwabish recommends that, as often as possible, titles for data visualizations should be written in a way that underscores what a reader should learn from a visualization. By doing so, titles that are focused on conveying a particular message to the reader minimizes processing requirements placed on a reader during encoding phase of visual processing. Moreover, Schwabish cites research that suggests that so-called ‘active titles’ are associated with greater recall in participant recollection of information conveyed through data visualization. Thus, for our project, we crafted a title that spoke directly to the narrative that we wished to highlight, which was that the number of education PhDs awarded in the U.S. has declined over the past decade.

F. A brief discussion of any major issues, snags, or hurdles you ran into and overcame during the process.

We experienced two issues during our revision of the visualization that we selected. Both issues occurred when we were applying Schwabish’s (2021) recommendation to integrate graph and text elements. First, we experienced difficulty coloring the single word ‘education’ in our title to the color of the trend line. We quickly learned that in order to change specific title colors, we needed the library ggtext installed. However, our biggest hang-up in selectively formatting the title’s coloring was learning how to format the HTML code. Specifically, the ggtext library allows users to write HTML code to tell R which word needs to be formatted and in what manner. After trial and error, we were able to correctly color the word ‘education’ to match the trend line.

The second challenge that we ran into was positioning the chart annotation in the correct coordinates in the graph. Through early iteration and attempts, it was not immediately clear to us why we could not place the annotation on the correct spot in our chart space. The annotation wasn’t showing up when we were attempting to set the x-values for the chart position to years. We realized however that since we had previously re-formatted our year variable to a factor data type, we would not be able to index the year column in reference to the numeric value of the year. Instead, with a factor data type, ggplot wanted us to refer to the level of year (e.g. 1 through 10) for the x-value. Once we made this connection to the data type for our year variable, we were able to place the annotation in our desired spot on the chart.